Goto

Collaborating Authors

 composite optimization



Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization

Neural Information Processing Systems

Due to their simplicity and excellent performance, parallel asynchronous variants of stochastic gradient descent have become popular methods to solve a wide range of large-scale optimization problems on multi-core architectures. Yet, despite their practical success, support for nonsmooth objectives is still lacking, making them unsuitable for many problems of interest in machine learning, such as the Lasso, group Lasso or empirical risk minimization with convex constraints. In this work, we propose and analyze ProxASAGA, a fully asynchronous sparse method inspired by SAGA, a variance reduced incremental gradient algorithm. The proposed method is easy to implement and significantly outperforms the state of the art on several nonsmooth, large-scale problems. We prove that our method achieves a theoretical linear speedup with respect to the sequential version under assumptions on the sparsity of gradients and block-separability of the proximal term. Empirical benchmarks on a multi-core architecture illustrate practical speedups of up to 12x on a 20-core machine.



Composite Optimization Algorithms for Sigmoid Networks

Chen, Huixiong, Ye, Qi

arXiv.org Artificial Intelligence

In this paper, we use composite optimization algorithms to solve sigmoid networks. We equivalently transfer the sigmoid networks to a convex composite optimization and propose the composite optimization algorithms based on the linearized proximal algorithms and the alternating direction method of multipliers. Under the assumptions of the weak sharp minima and the regularity condition, the algorithm is guaranteed to converge to a globally optimal solution of the objective function even in the case of non-convex and non-smooth problems. Furthermore, the convergence results can be directly related to the amount of training data and provide a general guide for setting the size of sigmoid networks. Numerical experiments on Franke's function fitting and handwritten digit recognition show that the proposed algorithms perform satisfactorily and robustly.


A One-Sample Decentralized Proximal Algorithm for Non-Convex Stochastic Composite Optimization

Xiao, Tesi, Chen, Xuxing, Balasubramanian, Krishnakumar, Ghadimi, Saeed

arXiv.org Artificial Intelligence

We focus on decentralized stochastic non-convex optimization, where $n$ agents work together to optimize a composite objective function which is a sum of a smooth term and a non-smooth convex term. To solve this problem, we propose two single-time scale algorithms: Prox-DASA and Prox-DASA-GT. These algorithms can find $\epsilon$-stationary points in $\mathcal{O}(n^{-1}\epsilon^{-2})$ iterations using constant batch sizes (i.e., $\mathcal{O}(1)$). Unlike prior work, our algorithms achieve comparable complexity without requiring large batch sizes, more complex per-iteration operations (such as double loops), or stronger assumptions. Our theoretical findings are supported by extensive numerical experiments, which demonstrate the superiority of our algorithms over previous approaches. Our code is available at https://github.com/xuxingc/ProxDASA.


Federated Composite Saddle Point Optimization

Bai, Site, Bullins, Brian

arXiv.org Artificial Intelligence

Federated learning (FL) approaches for saddle point problems (SPP) have recently gained in popularity due to the critical role they play in machine learning (ML). Existing works mostly target smooth unconstrained objectives in Euclidean space, whereas ML problems often involve constraints or non-smooth regularization, which results in a need for composite optimization. Addressing these issues, we propose Federated Dual Extrapolation (FeDualEx), an extra-step primal-dual algorithm, which is the first of its kind that encompasses both saddle point optimization and composite objectives under the FL paradigm. Both the convergence analysis and the empirical evaluation demonstrate the effectiveness of FeDualEx in these challenging settings. In addition, even for the sequential version of FeDualEx, we provide rates for the stochastic composite saddle point setting which, to our knowledge, are not found in prior literature.


On Underdamped Nesterov's Acceleration

Chen, Shuo, Shi, Bin, Yuan, Ya-xiang

arXiv.org Artificial Intelligence

The high-resolution differential equation framework has been proven to be tailor-made for Nesterov's accelerated gradient descent method~(\texttt{NAG}) and its proximal correspondence -- the class of faster iterative shrinkage thresholding algorithms (FISTA). However, the systems of theories is not still complete, since the underdamped case ($r < 2$) has not been included. In this paper, based on the high-resolution differential equation framework, we construct the new Lyapunov functions for the underdamped case, which is motivated by the power of the time $t^{\gamma}$ or the iteration $k^{\gamma}$ in the mixed term. When the momentum parameter $r$ is $2$, the new Lyapunov functions are identical to the previous ones. These new proofs do not only include the convergence rate of the objective value previously obtained according to the low-resolution differential equation framework but also characterize the convergence rate of the minimal gradient norm square. All the convergence rates obtained for the underdamped case are continuously dependent on the parameter $r$. In addition, it is observed that the high-resolution differential equation approximately simulates the convergence behavior of~\texttt{NAG} for the critical case $r=-1$, while the low-resolution differential equation degenerates to the conservative Newton's equation. The high-resolution differential equation framework also theoretically characterizes the convergence rates, which are consistent with that obtained for the underdamped case with $r=-1$.


Linear Convergence of ISTA and FISTA

Li, Bowen, Shi, Bin, Yuan, Ya-xiang

arXiv.org Artificial Intelligence

In this paper, we revisit the class of iterative shrinkage-thresholding algorithms (ISTA) for solving the linear inverse problem with sparse representation, which arises in signal and image processing. It is shown in the numerical experiment to deblur an image that the convergence behavior in the logarithmic-scale ordinate tends to be linear instead of logarithmic, approximating to be flat. Making meticulous observations, we find that the previous assumption for the smooth part to be convex weakens the least-square model. Specifically, assuming the smooth part to be strongly convex is more reasonable for the least-square model, even though the image matrix is probably ill-conditioned. Furthermore, we improve the pivotal inequality tighter for composite optimization with the smooth part to be strongly convex instead of general convex, which is first found in [Li et al., 2022]. Based on this pivotal inequality, we generalize the linear convergence to composite optimization in both the objective value and the squared proximal subgradient norm. Meanwhile, we set a simple ill-conditioned matrix which is easy to compute the singular values instead of the original blur matrix. The new numerical experiment shows the proximal generalization of Nesterov's accelerated gradient descent (NAG) for the strongly convex function has a faster linear convergence rate than ISTA. Based on the tighter pivotal inequality, we also generalize the faster linear convergence rate to composite optimization, in both the objective value and the squared proximal subgradient norm, by taking advantage of the well-constructed Lyapunov function with a slight modification and the phase-space representation based on the high-resolution differential equation framework from the implicit-velocity scheme.


Proximal Subgradient Norm Minimization of ISTA and FISTA

Li, Bowen, Shi, Bin, Yuan, Ya-xiang

arXiv.org Artificial Intelligence

For first-order smooth optimization, the research on the acceleration phenomenon has a long-time history. Until recently, the mechanism leading to acceleration was not successfully uncovered by the gradient correction term and its equivalent implicit-velocity form. Furthermore, based on the high-resolution differential equation framework with the corresponding emerging techniques, phase-space representation and Lyapunov function, the squared gradient norm of Nesterov's accelerated gradient descent (\texttt{NAG}) method at an inverse cubic rate is discovered. However, this result cannot be directly generalized to composite optimization widely used in practice, e.g., the linear inverse problem with sparse representation. In this paper, we meticulously observe a pivotal inequality used in composite optimization about the step size $s$ and the Lipschitz constant $L$ and find that it can be improved tighter. We apply the tighter inequality discovered in the well-constructed Lyapunov function and then obtain the proximal subgradient norm minimization by the phase-space representation, regardless of gradient-correction or implicit-velocity. Furthermore, we demonstrate that the squared proximal subgradient norm for the class of iterative shrinkage-thresholding algorithms (ISTA) converges at an inverse square rate, and the squared proximal subgradient norm for the class of faster iterative shrinkage-thresholding algorithms (FISTA) is accelerated to convergence at an inverse cubic rate.


Learning Distributionally Robust Models at Scale via Composite Optimization

Haddadpour, Farzin, Kamani, Mohammad Mahdi, Mahdavi, Mehrdad, Karbasi, Amin

arXiv.org Machine Learning

To train machine learning models that are robust to distribution shifts in the data, distributionally robust optimization (DRO) has been proven very effective. However, the existing approaches to learning a distributionally robust model either require solving complex optimization problems such as semidefinite programming or a first-order method whose convergence scales linearly with the number of data samples -- which hinders their scalability to large datasets. In this paper, we show how different variants of DRO are simply instances of a finite-sum composite optimization for which we provide scalable methods. We also provide empirical results that demonstrate the effectiveness of our proposed algorithm with respect to the prior art in order to learn robust models from very large datasets.